Intern Insight

On this page

  • About the Data
  • At A Glance - Summer 2023
    • Mean Trends by Metric
    • Median Trends by Metric
  • Demographic Insights
    • Geographics
    • Gender
  • Academic & Holistic Insights
  • Conclusions

Intern Insight

  • Show All Code
  • Hide All Code

A Student Internship Admissions Portal

Author

Amelia Baier, Andrea Dukic, Mia Mayerhofer

Welcome to Intern Insight, Data Tech College’s Dedicated Student Internship Admissions Portal!


The goal of this analytics capability is to provide effective insights into the internship activity of Data Tech College students each year and foster data-driven action to help advance the careers of the students.

The portal is designed for interactivity and features customizable visuals and toggle tooltips for an enhanced user experience.

About the Data

The dataset contained summer internship results on 80 students who attend Data Tech College. Features within the data included academic attributes such as the student’s test score, GPA and writing scores. Holistic features were also included such as volunteer and work experience. Demographic information such as state and gender were also found in the data.

The original internship admissions data set contained outliers such as erroneous GPA and demographic values, which were subsequently removed during data preprocessing. All visualizations and results presented below are based on the cleaned data set, which excludes the outlier rows.

At A Glance - Summer 2023

Mean Trends by Metric

Below is an interactive bar plot displaying the mean of each numerical metric (GPA, Test Score, etc.) by internship admissions decision. To switch between metrics, click on the drop down and select your metric of choice.

Code
# Import packages
import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
from mpl_toolkits.mplot3d import Axes3D
from vega_datasets import data

# Read in clean data
df = pd.read_csv("../data/clean_data.csv")

# Get mean and median dfs
means = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["mean"]).reset_index()
means.columns = means.columns.droplevel(1)
means.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]
medians = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["median"]).reset_index()
medians.columns = medians.columns.droplevel(1)
medians.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]

# Plotly visual
variables = ["GPA", "Test Score", "Writing Score", "Work Experience", "Volunteer Level"]
order = ["Admit", "Waitlist", "Decline"]

pio.renderers.default = "plotly_mimetype+notebook"

# Add traces
plot = go.Figure(data=[
    go.Bar( 
        name = "GPA", 
        x = means["Decision"], 
        y = means["GPA"],
        marker_color = "#0E6BA8"
    ), 
    go.Bar( 
        name = "Test Score", 
        x = means["Decision"], 
        y = means["Test Score"],
        marker_color = "#6F0624",
        visible = False 
    ), 
    go.Bar( 
        name = "Writing Score", 
        x = means["Decision"], 
        y = means["Writing Score"],
        marker_color = "#8B748F",
        visible = False  
    ), 
    go.Bar( 
        name = "Work Experience", 
        x = means["Decision"], 
        y = means["Work Experience"],
        marker_color = "#00072D",
        visible = False 
    ), 
    go.Bar( 
        name = "Volunteer Level", 
        x = means["Decision"], 
        y = means["Volunteer Level"],
        marker_color = "#0A2472",
        visible = False
    ) 
]) 

# Set the initial view to JUST be GPA
initial_view = {"visible": [True, False, False, False, False]}

# List of titles to use
titles = ["Mean GPA", "Mean Test Score", "Mean Writing Score", "Mean Years of Work Experience", "Mean Volunteer Level"]

# Dropdown
plot.update_layout( 
    updatemenus=[ 
        dict( 
            active = 0, 
            x = -0.1, 
            y = 0.7,
            buttons=list([ 
                dict(label = variable, 
                     method = "update", 
                     args=[{"visible": [i == j for i in range(len(variables))]}, 
                           {"title": f"{titles[j]} by Admissions Decision",  
                            "xaxis_title": "Admissions Decision", 
                            "yaxis_title": titles[j], 
                            "xaxis": {"categoryorder": "array", "categoryarray": order}
                            }]) for j, variable in enumerate(variables)
            ]), 
        ) 
    ], 
    title_text = f"{titles[0]} by Admissions Decision",
    xaxis = dict(categoryorder="array", categoryarray=order),
    showlegend = True,
    margin = dict(l = 50, r = 50, t = 50, b = 50)
) 

plot.show()

Median Trends by Metric

Below is an interactive bar plot displaying the median of each numerical metric (GPA, Test Score, etc.) by internship admissions decision. To switch between metrics, click on the drop down and select your metric of choice.

Code
pio.renderers.default = "plotly_mimetype+notebook"

# Add traces
plot = go.Figure(data=[
    go.Bar( 
        name = "GPA", 
        x = medians["Decision"], 
        y = medians["GPA"],
        marker_color = "#0E6BA8"
    ), 
    go.Bar( 
        name = "Test Score", 
        x = medians["Decision"], 
        y = medians["Test Score"],
        marker_color = "#6F0624",
        visible = False 
    ), 
    go.Bar( 
        name = "Writing Score", 
        x = medians["Decision"], 
        y = medians["Writing Score"],
        marker_color = "#8B748F",
        visible = False  
    ), 
    go.Bar( 
        name = "Work Experience", 
        x = medians["Decision"], 
        y = medians["Work Experience"],
        marker_color = "#00072D",
        visible = False 
    ), 
    go.Bar( 
        name = "Volunteer Level", 
        x = medians["Decision"], 
        y = medians["Volunteer Level"],
        marker_color = "#0A2472",
        visible = False
    ) 
]) 

# Set the initial view to JUST be GPA
initial_view = {"visible": [True, False, False, False, False]}

# List of titles to use
titles = ["Median GPA", "Median Test Score", "Median Writing Score", "Median Years of Work Experience", "Median Volunteer Level"]

# Dropdown
plot.update_layout( 
    updatemenus=[ 
        dict( 
            active = 0, 
            x = -0.1, 
            y = 0.7,
            buttons=list([ 
                dict(label = variable, 
                     method = "update", 
                     args=[{"visible": [i == j for i in range(len(variables))]}, 
                           {"title": f"{titles[j]} by Admissions Decision",  
                            "xaxis_title": "Admissions Decision", 
                            "yaxis_title": titles[j], 
                            "xaxis": {"categoryorder": "array", "categoryarray": order}
                            }]) for j, variable in enumerate(variables)
            ]), 
        ) 
    ], 
    title_text = f"{titles[0]} by Admissions Decision",
    xaxis = dict(categoryorder="array", categoryarray=order),
    showlegend = True,
    margin = dict(l = 50, r = 50, t = 50, b = 50)
) 

plot.show()

Demographic Insights

The following section breaks down the relationship between demographics (gender & state) and internship application decisions.

Below are the number of students per state and decision. Note that for most states and decisions there are only a handful of students in each row. This means that the analysis conducted later cannot be representative of the entire population.

Code
import pandas as pd
import altair as alt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import matplotlib.pyplot as plt

from vega_datasets import data

df = pd.read_csv('../data/clean_data.csv')

decision_count = df.groupby(['Decision', 'State']).size().reset_index()
decision_count = decision_count.rename(columns={0: 'Count'})
Decision State Count
Admit California 9
Admit Colorado 8
Admit Florida 11
Admit Utah 1
Decline California 1
Decline Colorado 6
Decline Florida 13
Decline Mississippi 1
Decline Oregon 1
Decline Utah 2
Decline Virginia 4
Waitlist Alabama 1
Waitlist California 2
Waitlist Colorado 4
Waitlist Florida 11
Waitlist New York 1
Waitlist Utah 3
Waitlist Vermont 1

Geographics

To provide an overview of the data, we will be looking at the data from a geographic perspective, specifically at the state level.

Code
import plotly.express as px
import pandas as pd

custom_palette = ['#00072D', '#0A2472', '#0E6BA8', '#A6E1FA', '#99ABC5', '#8B748F', '#6F0624']
df_copy = df.copy()
state_abbreviations = {
    'California': 'CA',
    'Florida': 'FL',
    'Colorado': 'CO',
    'Utah': 'UT',
    'Oregon': 'OR',
    'Virginia': 'VA',
    'Mississippi': 'MS',
    'New York': 'NY',
    'Alabama': 'AL',
    'Vermont': 'VT'
}
df_copy['State'] = df_copy['State'].map(state_abbreviations)


state_decision_counts = df_copy.groupby(['State', 'Decision']).size().reset_index(name='Counts')
state_decision_counts['Decision_Counts'] = state_decision_counts['Decision'] + " : " + state_decision_counts['Counts'].astype(str)

hover_data = state_decision_counts.pivot(index='State', columns='Decision', values='Decision_Counts').reset_index()
hover_data['hover_text'] = hover_data.apply(lambda row: ', '.join(row.dropna()[1:]), axis=1)

total_applications = df_copy.groupby('State').size().reset_index(name='Total Applications')
final_data = total_applications.merge(hover_data[['State', 'hover_text']], on='State', how='left')

pio.renderers.default = "plotly_mimetype+notebook"

fig = px.choropleth(final_data,
                    locations='State', 
                    locationmode="USA-states", 
                    color='Total Applications',
                    color_continuous_scale=custom_palette,
                    scope="usa",
                    title="Total Internship Applications by State & Outcome",
                    hover_name='State',
                    hover_data={'State': False, 'Total Applications': True, 'hover_text': True},
                    labels={'hover_text': 'Decisions'}
                   )

fig.show()

Features by State

Code
custom_palette = ['#00072D', '#0A2472', '#0E6BA8', '#A6E1FA', '#99ABC5', '#8B748F', '#6F0624']

#calculate averages of all numeric columns
num_cols = df[['State', 'GPA', 'WorkExp', 'TestScore', 'WritingScore', 'VolunteerLevel']]
avg_df = num_cols.groupby('State').mean().reset_index()
state_abbr = {
    'Alabama': 'AL',
    'California': 'CA',
    'Colorado': 'CO',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Mississippi': 'MS',
    'New York': 'NY',
    'Oregon': 'OR',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA'

}
avg_df['State_Abbr'] = avg_df['State'].map(state_abbr)
avg_df = avg_df.drop(columns=['State'])
avg_df = avg_df.rename(columns={'State_Abbr': 'State'})

pio.renderers.default = "plotly_mimetype+notebook"

fig = px.choropleth(avg_df, locationmode="USA-states", 
                    locations=avg_df['State'], 
                    scope="usa",
                    color=avg_df['GPA'],
                    hover_data={"State": True, "GPA": True},  
                    labels={"GPA": "Selected Variable"},
                    color_continuous_scale=custom_palette
                )

dropdown = []
for col in avg_df.columns[:-1]:
    dropdown.append({'label': col, 'method': 'update', 'args': [{'z': [avg_df[col]]}]})

fig.update_layout(updatemenus=[{'buttons': dropdown, 'direction': 'down', 'showactive': True}],
                  title='Choropleth Map of Average Selected Variable')
fig.update_coloraxes(colorbar_title=dict(text='Selected Variable'))

fig.show()

Above is a choropleth map of the average numeric feature (GPA, test score, writing score, work experience in years, and volunteer level) by state. The average of the numeric features is calculated across all decision types to obtain a holistic view of the student data by state. Below we will summarize some findings for each feature:

GPA Test Score Writing Score
California has the highest average GPA, with Florida and New York close behind. California has the highest average test score. California has the highest average writing score.
Oregon and Mississippi have the lowest average GPA. Mississippi has the lowest average test score. New York has the lowest average writing score.
Work Experience Volunteer Level
Mississippi has the highest average work experience in years. Oregon has the highest average volunteer level.
Oregon has the lowest average work experience. Alabama has the lowest average volunteer level.

We can also look at some of these features at the geographic level by decision.

Code
admit = df[df['Decision'] == 'Admit']
num_cols = admit[['State', 'GPA', 'WorkExp', 'TestScore', 'WritingScore', 'VolunteerLevel']]
avg_admit = num_cols.groupby('State').mean().reset_index()
avg_admit['State_Abbr'] = avg_admit['State'].map(state_abbr)

decline = df[df['Decision'] == 'Decline']
num_cols = decline[['State', 'GPA', 'WorkExp', 'TestScore', 'WritingScore', 'VolunteerLevel']]
avg_decline = num_cols.groupby('State').mean().reset_index()
avg_decline['State_Abbr'] = avg_decline['State'].map(state_abbr)
Code
import altair as alt
from vega_datasets import data

state_id_dict = dict(zip(data.population_engineers_hurricanes()["state"], data.population_engineers_hurricanes()["id"]))
avg_admit["StateID"] = avg_admit["State"].map(state_id_dict)
avg_decline["StateID"] = avg_decline["State"].map(state_id_dict)

states = alt.topo_feature('https://raw.githubusercontent.com/vega/vega-datasets/master/data/us-10m.json', 'states')
click = alt.selection_multi(fields = ["State"])

existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("GPA:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "GPA:Q"],
    opacity = alt.condition('isValid(datum.GPA)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(avg_admit, "StateID", list(avg_admit.columns))
).properties(width = 333, height = 200, title="Average Admitted GPA by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.GPA)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

admit_gpa = existing_states + missing_states
admit_gpa = admit_gpa.encode(
    tooltip= ["State:N", "GPA:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(avg_admit, "StateID", list(avg_admit.columns))
    ).interactive()

existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("GPA:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "GPA:Q"],
    opacity = alt.condition('isValid(datum.GPA)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(avg_decline, "StateID", list(avg_decline.columns))
).properties(width = 333, height = 200, title="Average Declined GPA by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.GPA)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

decline_gpa = existing_states + missing_states
decline_gpa = decline_gpa.encode(
    tooltip= ["State:N", "GPA:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(avg_decline, "StateID", list(avg_decline.columns))
    ).interactive()

admit_gpa | decline_gpa
Code
existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("TestScore:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "TestScore:Q"],
    opacity = alt.condition('isValid(datum.TestScore)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(avg_admit, "StateID", list(avg_admit.columns))
).properties(width = 333, height = 200, title="Average Admitted Test Score by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.TestScore)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

admit_test = existing_states + missing_states
admit_test = admit_test.encode(
    tooltip= ["State:N", "TestScore:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(avg_admit, "StateID", list(avg_admit.columns))
    ).interactive()

existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("TestScore:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "TestScore:Q"],
    opacity = alt.condition('isValid(datum.TestScore)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(avg_decline, "StateID", list(avg_decline.columns))
).properties(width = 333, height = 200, title="Average Declined Test Score by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.TestScore)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

decline_test = existing_states + missing_states
decline_test = decline_test.encode(
    tooltip= ["State:N", "TestScore:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(avg_decline, "StateID", list(avg_decline.columns))
    ).interactive()

admit_test | decline_test

As we can see from the average GPA and test scores for admitted and declined students by state, students who were admitted had higher GPAs and test scores than those who were declined.

This insight can help us to improve students’ test scores so as to increase their chances of being admitted to an internship.

Decision Rates by State

We can also see the rates of students admitted and declined from internships by state to see overall how successful are the students from the selected states.

Code
#create dataframe of rates for each state by decision
decision_state = df.groupby(['Decision', 'State'])[["GPA"]].count().reset_index()
decision_state = decision_state.rename(columns={'GPA':'StateCount'})
decision_state['DecisionCount'] = decision_state.groupby('Decision')['StateCount'].transform('sum')
decision_state['Rate'] = decision_state['StateCount'] / decision_state['DecisionCount'] * 100

state_id_dict = dict(zip(data.population_engineers_hurricanes()["state"], data.population_engineers_hurricanes()["id"]))
decision_state["StateID"] = decision_state["State"].map(state_id_dict)

admit_states = decision_state[decision_state['Decision'] == "Admit"]
decline_states = decision_state[decision_state['Decision'] == "Decline"]
Code
states = alt.topo_feature('https://raw.githubusercontent.com/vega/vega-datasets/master/data/us-10m.json', 'states')
click = alt.selection_multi(fields = ["State"])

existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("Rate:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "Rate:Q"],
    opacity = alt.condition('isValid(datum.Rate)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(admit_states, "StateID", list(admit_states.columns))
).properties(width = 333, height = 200, title="Admission Rates by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.Rate)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

admit_map = existing_states + missing_states
admit_map = admit_map.encode(
    tooltip= ["State:N", "Rate:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(admit_states, "StateID", list(admit_states.columns))
    ).interactive()

existing_states = alt.Chart(states).mark_geoshape(stroke='black').encode(
    color = alt.Color("Rate:Q", scale=alt.Scale(range=custom_palette)),
    tooltip = ["State:N", "Rate:Q"],
    opacity = alt.condition('isValid(datum.Rate)', alt.value(1), alt.value(0.2)),
).transform_lookup(
    lookup = "id",
    from_ = alt.LookupData(decline_states, "StateID", list(decline_states.columns))
).properties(width = 333, height = 200, title="Rejection Rates by State").add_selection(click).project(type = "albersUsa").interactive()

missing_states = (
    alt.Chart(states)
    .mark_geoshape(fill = "grey", stroke = "white")
    .encode(opacity=alt.condition("isValid(datum.Rate)", alt.value(0), alt.value(0.2))).add_selection(click).project(type = "albersUsa")
    )

decline_map = existing_states + missing_states
decline_map = decline_map.encode(
    tooltip= ["State:N", "Rate:Q"]
    ).transform_lookup(
        lookup="id",
        from_=alt.LookupData(decline_states, "StateID", list(decline_states.columns))
    ).interactive()

admit_map | decline_map

Above are the maps of the rates of the students admitted by state and the rates of the students declined by state. Some findings from the maps are:

  • Florida had the highest rate of admitted students.
  • Utah had the lowest rate of admitted students.
  • Florida also has the highest rate of rejected students.
  • California, Oregon, and Mississippi all have the lowest rate of rejected students.

There isn’t a clear relationship between admissions and rejections by state, which means that the state a student is from is not pivotal in the decision of their internship application.

Gender

It is important to establish that internship opportunities are given fairly and equitably to all students regardless of gender. Analyzing decisions by gender can highlight any discrepancies or biases in the selection process, which is the primary focus of the following section.

Code
%%html
<figure>
    <img src="../website/images/heatmap.png"  style="display: block; margin-left: auto; margin-right: auto;">
    <figcaption style="text-align: center;">Figure 1: Heatmap of student internship decisions by gender to highlight any biases that may appear.</figcaption>
</figure>
Figure 1: Heatmap of student internship decisions by gender to highlight any biases that may appear.

The heatmap above display internship decision counts among females and males. Because the colors represent counts in very similar ranges, it appears gender is not a contribuing factor to internship decisions. However, it’s important to back that statement with statistics, such as a Chi-Square test. Due to the relatively small size of the data set, a chi-square test may not always be accurate. An exact test was also performed, Fisher’s Exact Test, to verify the result from the Chi-Square test.

Code
from scipy.stats import chi2_contingency

cont_table = pd.crosstab(df["Gender"], df["Decision"])
chi2_stat, p_value_chi2, _, _ = chi2_contingency(cont_table)
Statistic - Chi-Square Test Value
Chi-Square Statistic 0.0234
Chi-Square P-value 0.9883

The Chi-Square statistic measures the difference between gender frequencies in each decision category within the data and gender frequencies in each decision that would be expected if there was no association between the variables. A lower chi-square value as seen above, indicates that the observed frequencies are very close to the expected frequencies. The large P-value (greater than 0.05) confirms that there is no significant association between gender and admission decisions, indicating no evidence of bias based on this test; a desired result.

Code
from scipy.stats import fisher_exact
cont_table_small = cont_table[["Admit", "Decline"]]
odds_ratio, p_value_fish = fisher_exact(cont_table_small)
Statistic - Fisher’s Exact Test Value
Fisher’s Odds Ratio 1.0833
Fisher’s Test P-Value 1.0

For Fisher’s Exact Test, the odds ratio is the ratio of the odds of an event occurring in one group compared to another. There is a positive association if the odds ratio is greater than 1. The odds ratio of 1.083 means that the odds of being admitted for one group are 8.3% higher than the odds of the other group being admitted. While Fisher’s Exact found a difference, the p-value for the test is equal to 1, which is higher than above any given significance level. This verifies the results of the Chi-Square test, indicating no evidence of bias between admisisons and gender.

Academic & Holistic Insights

Though the usage of pariplots and machine learning techniques, relationships between student’s academic features (GPA, writing score, test score) and the internship application outcome can be understood. This information can help students at Data Tech understand what features of their application may contribute to internship decisions and how strongly. Futhremore, this analysis can provide insight into areas the curriculum that may or may not need targeted attention to ensure the students increase their internship admission chances.

Code
%%html
<img src="../website/images/decision_pairplot.png">

Above is a pairplot of GPA, writing score, and test score of the students grouped by the decision. When looking at the scatterplots, we notice some patterns:

  • Students with low test score, no matter the GPA, were declined.
  • Students with high test score and high GPA were accepted.
  • Students with a pretty high GPA but average test score were waitlisted.
  • Students with high test score, no matter the writing score, were admitted.
  • Students with a low test score, no matter the writing score, were declined.
  • Students with high writing scores but average test score were waitlisted.

Through the pairplot, it is apparent that some of the academic features have relationships by decision result, but some features seem to be more important than others.

To understand the factors influencing college student internship decisions, we employed a tree-based machine learning model, specifically XGBoost, to analyze the data. To interpret the model’s predictions and assess the impact of each factor, we utilized Shapley values, a concept from cooperative game theory. This analysis enables us to identify which factors most strongly influence internship decisions. Gaining insights into these relationships will help us pinpoint areas for improvement in the college curriculum, ensuring that students are well-prepared and have the highest likelihood of securing summer internships.

Code
%%html
<figure>
    <img src="../website/images/shap.png"  style="display: block; margin-left: auto; margin-right: auto;">
    <figcaption style="text-align: center;">Figure 1: Visualization of SHAP values indicating the overall impact of various features on internship decisions.</figcaption>
</figure>
Figure 1: Visualization of SHAP values indicating the overall impact of various features on internship decisions.

The figure above quantifies student’s academic and holistic attributes influence on internship application outcomes overall. Higher SHAP values mean those features have a greater impact internship decisions. Conversely, lower SHAP values indicate factors that are less important in the internship decision making process.

As we can see, test scores, GPA, and writing scores are among the top contributors while features such as work experience and volunteer level are not weighed as heaviliy. This highlights that companies are looking to the student’s academic background as a main focus for their decision compared to their holistic attributes.

Code
%%html
<figure>
    <img src="../website/images/spec_shap.png"  style="display: block; margin-left: auto; margin-right: auto;">
    <figcaption style="text-align: center;">Figure 2: Visualization of SHAP values indicating the impact of various features on internship application outcomes by decision.</figcaption>
</figure>
Figure 2: Visualization of SHAP values indicating the impact of various features on internship application outcomes by decision.

While the previous plot displayed how student attributes play a role in the internship decision making process overall, Shapley statistics allow further steps to be taken by analyzing how each feature contributes to each possible deicion (Admit, Waitlist, Decline).

The figure above displays just that. It appears that the test score is the most significant factor contributing to a student’s likelihood of being admitted. It falls in line with the earlier pairplot as we saw significant overlap in internship decisions among GPAs and writing scores, but distinct separation for test scores. This suggests that students with higher test scores have a greater advantage in the competitive internship landscape.

For students who are placed on a waitlist, both test scores and GPA are important considerations. This might indicate that students on the waitlist have comparatively lower test scores than those who are admitted outright. The pairplot displays exactly that so it seems that for these students, academic performance is a deciding factor that could tip the balance in their favor for internship admission.

On the other hand, for students who are declined, the test score still holds considerable weight, but the writing score becomes notably more influential. This pattern could imply that declined students, while possibly having adequate test scores, may fall short in demonstrating the necessary writing proficiency, which is critical for many internships that require strong communication skills.

Conclusions

These insights suggest a couple of strategic focuses for the institution:

Test Score Improvement: Continue to prioritize and enhance test preparation services, ensuring that the students can achieve the highest scores possible.

Academic Support: Given the importance of GPA, particularly for waitlisted students, bolstering academic support can help these students improve their standing and increase their chances of moving from waitlist to admit.

Writing Proficiency: Addressing the writing skills that impact both waitlisted and declined students, consider expanding the writing centers and integrating more communication-focused workshops into the student services.

By concentrating on these areas, the college will help its students to not only meet but exceed the expectations of internship programs, thereby improving their chances of being admitted.

Made with Quarto